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Abstract — Many network information theory problems face the 
similar difficulty of single letterization. We argue that this is due 
to the lack of a geometric structure on the space of probability 
distribution. In this paper, we develop such a structure by 
assuming that the distributions of interest are close to each other. 
Under this assumption, the K-L divergence is reduced to the 
squared Euclidean metric in an Euclidean space. Moreover, we 
construct the notion of coordinate and inner product, which will 
facilitate solving communication problems. We will also present 
the application of this approach to the point-to-point channel 
and the general broadcast channel, which demonstrates how our 
technique simplifies information theory problems. 

I. Introduction 

Since Shannon introduced the notion of capacity sixty years 
ago, finding the capacity of channels and networks are core 
problems in information theory. The analyzation of capacity 
answers the problem that how many bits can be transmitted 
through a communication network, and also provides many 
insights in engineer problems |1 ]. However, for general prob- 
lems, there is no systematic way to obtain optimal single-letter 
solutions. By cleverly picking auxiliary random variables, we 
sometimes can prove the constructed single-letter solutions are 
optimal for some problems. But, when this fails, we cannot 
tell whether it is because we have not tried hard enough or the 
problem itself does not have an optimal single-letter solution. 

The difficulty of obtaining optimal single letter solutions 
comes from the fact that, most of the information theoretical 
quantities, such as entropy, mutual information, and error 
exponents, are all special cases of the Kullback-Leibler (K- 
L) divergence. The K-L divergence is a measure of distance 
between two probability distributions. However, in multi- 
terminal communication problems, there are multiple input 
and output distributions, and we usually need to deal with 
problems in high dimensional probability spaces. In these 
cases, describing problems only with the distance measure is 
cumbersome, and solving these information theory problems 
turns out to be extremely hard even with numerical aids. 
Therefore, we need more geometric structures to describe the 
problems, such as inner products. This is however difficult, 
as the K-L divergence between two distributions, D(P\\Q), is 
not symmetric between P and Q, and the K-L divergence is 
in general not a valid metric. Thus, the space of probability 
distributions is not a linear vector space but a manifold 0, 
when the K-L divergence behaves as the distance measure. 



In this paper, we present an approach [3j to simplify the 
problems with the assumption that the distributions of interest 
are close to each other. With this assumption, the manifold 
formed by distributions can be approximated by a tangent 
plane, which is an Euclidean space. Moreover, the K-L diver- 
gence will behave as the squared Euclidean metric between 
distributions in this Euclidean space. Therefore, we obtain the 
notion of coordinate, inner product, and orthogonality in a 
linear metric space, to describe information theory problems. 
Moreover, we will demonstrate in the rest sections that, 
the linear structure constructed from out local approximation 
will transfer information theory problems to linear algebra 
problems, which can be solved easily. In particular, we show 
that a systematic approach can be used to solve the single- 
letterization problems. We apply this to the general broadcast 
channel, and obtain new insights of the optimality of the 
existing solutions iBl-ITl 

The rest of this paper is organized as follows. We introduce 
the notion of local approximation in section [nj and show 
that the K-L divergence can be approximated as the squared 
Euclidean metric. In section [HI] and |IV| we present the 
application of our local approximation to the point-to-point 
channel and the general broadcast channel, respectively. We 
will illustrate how the information theory problems become 
simple linear algebra problems when applying our technique. 

II. The Local Approximation 

The key step of our approach is to use a local approximation 
of the Kullback-Leibler (K-L) divergence. Let P and Q be 
two distributions over the same alphabet X. We assume that 
Q(x) = P(x) + eJ(x), for some small value e, then the K-L 
divergence can be written, with second order Taylor expansion, 
as 

D(P\\Q) = -J2P(x)log^- 

X ^ ' 

X v 7 

We denote ^2 X J 2 (x)/P(x) as \\J\\p, which is the weighted 
norm square of the perturbation vector J. It is easy to verify 
here that replacing the weight in this norm by Q only results in 



a o(e 2 ) difference. That is, up to the first order approximation, 
the weights in the norm simply indicate the neighborhood 
of distributions where the divergence is computed. As a 
consequence, D(P\\Q) and D(Q\\P) are considered as equal 
up to the first order approximation. 

For convenience, we define the weighted perturbation vector 



L(x) 



or in vector form L 



J, where 



represents 



y/P(x) \ x e Pd}. This 
where the last norm is 



the diagonal matrix with entries 

allows us to write \\J\\ 2 P = \\L\\ 
simply the Euclidean norm. 

With this definition of the norm on the perturbations of 
distributions, we can generalize to define the corresponding 
notion of inner products. Let Qi{x) = P(x) + e-Ji(x), Vx, i = 
1,2, we can define 

1 



^P(x) 
-l 



Ji(x)J 2 (x) = (Li,L 2 ), 



where Li = \JP Ji, for i = 1, 2. From this, notions 
of orthogonal perturbations and projections can be similarly 
defined. The point here is that we can view a neighborhood 
of distributions as a linear metric space, and define notions of 
orthonormal basis and coordinates on it. 

III. The Point to Point Channel 

We start by using this local geometric structure to study 
the point-to-point channels to demonstrate the new insights 
we can obtain from this approach, even on a well-understood 
problem. It is well-known that the capacity problem is 



max I(X;Y), 

Px 



(1) 



but this is in fact a single letter solution of the coding problem 



max -7(/7;Y" n ), 



(2) 



for some discrete random variable U, such that U — >• X n — » 
Y n forms a Markov chain. 

Now, to apply the local approximations, instead of solving 
([2]). we study a slightly different problem 

l I(U-Y n ). (3) 



max 

+Y n -.l-I(U;X n )<±e 2 71 



We call the problem ^ as the linear information coupling 
problem. The only difference between ^ and ^ lies in the 
constraint -I(U\ X n ) < ^e 2 on ([3j. That is, instead of trying 
to find how many bits in total that we can send through the 
given channel, we ask the question of how efficiently we can 
send a thin layer of information through this channel. One 
advantage of ^ is that it allows easy single letterization as 
we will demonstrate in the following. In fact, the step of 
single-letterization, namely, form ^ to ([I]), is the difficult step 
of most network problems. For these problems, the approach 



we used for the point-to-point problem can not be applied. 
What we will show in the rest of this section is that there 
is an alternative approach to do the well-known steps Q 
to go from ^ to ([I]), and this new approach based on the 
geometric structures can be applied to more general problems. 
For simplicity, in this paper, we assume that the marginal 
distribution Px^ is given, and is an i.i.d. distribution over 
the 7i letter^] so that we can focus on finding U and the 
conditional distribution Px n \u optimizing ([3]). 

First, we solve the single-letter version, namely n = 1, of 
this problem. Observing that we can write the constraint as 

I(U;X) = J2Pu(u)-D(P x]u (-\u)\\P x ) < \e 2 . 

U 

This implies that for each value of u, the conditional dis- 
tribution Px\u=u is a l° ca l perturbation from Px, that is, 
Px\u=u = Px + e • Ju- 

Next, using the notation that L u - 
value of u, we observe that 



J u , for each 



Y\U=u 



WP X \U=u 

WP X +e-WJ u 

Py + e-W[^Px~]L Ul 



where the channel applied to an input distribution is simply 
viewed as the channel matrix W, of dimension \y\ x 
multiplying the input distribution as a vector. At this point, we 
have reduced both the spaces of input and output distributions 
as linear spaces, and the channel acts as a linear transform 
between these two spaces. The information coupling problem 
can be rewritten as, ignoring the o(e 2 ) terms: 



max. ^2Pu(u)-\\WJ u 



subject to: ^Pu(u) ■ \\J u \\ 2 Px = 1 



or equivalently in terms of Euclidean norms, 



max. ^2p v (u)-\\ ^JPy~ 1 W 

u 

subject to: • ||L„|| 2 = 1. 



■L u \\\ 



This problem of linear algebra is simple. We need to find 
the joint distribution U —> X — >> Y by specifying the Pjj and 
the perturbations J u for each value of u, such that the marginal 
constraint on Px is met, and also these perturbations are the 
most visible at the Y end, in the sense that multiplied by the 
channel matrix, WJ u 's have large norms. This can be readily 
solved by setting the weighted perturbation vectors L n 's to 
be along the input (right) singular vectors of the matrix B = 
\[Py W [y/Px] with large singular values. Moreover, the 

! This assumption can be proved to be "without loss of the optimality" for 
some cases 0. In general, it requires a separate optimization, which is not 
the main issue addressed in this paper. To that end, we also assume that the 
given marginal Px™has strictly positive entries. 




(a) 




(b) 

Fig. 1. (a) Choice of P\j and Px\u t0 maintain the marginal Px- (b) 
Divergence Transition Map as a linear map between two spaces, with right 
and left singular vectors as orthonormal bases. 



choice of P\j has no effect in the optimization, and might be 
taken as binary uniform for simplicity. This is illustrated in 



Figure 1(a) 



We call the matrix B the divergence transition matrix 
(DTM). It maps divergence in the space of input distribu- 
tions to that of the output distributions. The singular value 
decomposition (SVD) structure of this linear map has a critical 
role of our analysis. It can be shown that the largest singular 
value of B is 1, corresponding to an input singular vector 
[\APx, x G X\ , which is orthogonal to the simplex of prob- 
ability distributions. This is not a valid choice for perturbation 
vectors. However, all vectors orthogonal to this vector, or 
equivalently, all linear combinations of other singular vectors 
are valid choices of the perturbation vectors L u . Thus, the 
optimum of the above problem is achieved by setting L u to 
be along the singular vector with the second largest singular 
value. 

This can be visualized as in Figure l(b)| the orthonormal 
bases for the input and output spaces, respectively, according 
to the right and left singular vectors of B. The key point here 
is that while I(U;X) measures how many bits of information 
is modulated in X, depending on how they are modulated, in 
terms of which direction the corresponding perturbation vector 
is, these bits have different levels of visibility at the Y end. 
This is a quantitative way to show why viewing a channel as 
a bit-pipe carrying uniform bits is a bad idea. 

Moreover, recalling that the data processing inequality tells 
that, from the Markov chain U —> X — >• Y, the mutual 
informations have the relation I(U;X) > I(U;Y). Let us 
assume that the second largest singular value of B is a < 1, 
then the above derivations imply that a 2 -I(U;X) >I(U;Y). 



Thus, we actually come up with a stronger result than the 
data processing inequality, and the equality can be achieved 
by setting the perturbation vector to be along the right singular 
vector of B, with the second largest singular value. 

The most important feature of the linear information cou- 
pling problem is that the single-letterization ^ is simple. To 
illustrate the idea, we consider a 2-letter version of the point- 
to-point channel: 

max -I(U]Y 2 ). (4) 

U^X 2 ^Y 2 :±I(U;X 2 )<±e 2 2 

Let Px, Py, W, and B be the input and output distributions, 
channel matrix, and the DTM, respectively for the single letter 



version of the problem. Then, the 2-letter problem has P x 



(2) 



Px, P 



(2) 



P Y , and 



W ® W, where 



denotes the Kronecker product. As a result, the new DTM is 
B ( 2 ) = B <S> B. We have the following lemma on the singular 
values and vectors of B^ 2 \ 

Lemma 1. Let Vi and Vj denote two singular vectors of B 
with singular values \±i and fij. Then V{ <g> Vj is a singular 
vector of B^ and its singular value is fiifij. 

Now, recall that the largest singular value of B is //q = 1, 
with the singular vector vq = [y/Px, x e X] , which corre- 
sponds to the direction orthogonal to the distribution simplex. 
This implies that the largest singular value of B^ is also 
1, again corresponds to the direction that is orthogonal to all 
valid choices of the perturbation vectors. 

The second largest singular value of B^ is a tie between 
fiofii and /ii/io, with singular vectors ^o^i and v\ 0^o- The 
optimal solution of ^ is thus to set the perturbation vectors 
to be along these two vectors. This can be written as 



Px 2 \U=u 

=Px ® Px 
= {Px + e' 



x 



Vl 



X 



(evo ®vi + e'vi ® v ) 



)+0(e 2 ). 



Here, we use the fact that = [V^x, x e X] . This means 
that the optimal conditional distribution Px 2 \u=u f° r an Y u nas 
the product form, up to the first order approximation. With a 
simple time-sharing argument, it is easy to see that we can 
indeed set e = e', that is, pick this conditional distribution to 
be i.i.d. over the two symbols, to achieve the optimum. 

The simplicity of this proof of the optimality of the single 
letter solutions is astonishing. All we have used is the fact 
that the singular vector of B^ corresponding to the second 
largest singular value has a special form. A distribution in the 
neighborhood of Px Px is a product distribution if and only 
if it can be written as a perturbation from Px Px , along the 
subspace spanned by vectors (8) and Vj <S> vq, in the form 
of v + v f (g) vo, for some v and v' . Thus, all we need to 
do is to find the eigen- structure of the £?-matrix, and verify if 
the optimal solutions have this form. This procedure is used 
in more general problems. 



One way to explain why the local approximation is useful 
is as follows. In general, tradeoff between multiple K-L 
divergence (mutual information) is a non-convex problem. 
Thus, finding global optimum for such problems is in gen- 
eral intrinsically intractable and extremely hard. In contrast, 
with our local approximation, the K-L divergence becomes 
a quadratic function. Now, the tradeoff between quadratic 
functions remains quadratic. Effectively, our approach focus 
on verifying the local optimality of the quadratic solutions, 
which is a natural thing to do, since the overall problem is not 
convex. 

IV. The General Broadcast Channel 

Let us now apply our local approximation approach to the 
general broadcast channel. A general broadcast channel with 
input X G X, and outputs Y\ G y±, Y 2 £ 3^2, is specified by 
the memoryless channel matrices W\ and W 2 . These channel 
matrices specify the conditional distributions of the output 
signals at two users, 1 and 2 as Wi(yi\x) = Py i \x{yi\ x )^ f° r 
i = 1,2. Let Mi, M 2 , and M be the two private messages 
and the common message, with rate Ri, and R 2 , and Ro, 
respectively. The multi-letter capacity region can be written 

' Ro <±mm{I(U;YD,I(U;Y 2 n )}, 
Ri <i/(Vi;l?), 
R2 < ^I(V 2 ;Y 2 n ), 

for some mutually independent random variables U, V\, and 
V 2 , such that (U, V u V 2 ) -> X n -> forms a Markov 

chains. 

The linear information coupling problems of the private 
messages, given that the common message is decoded, are 
essentially the same as the point-to-point channel case. Thus, 
we only need to focus on the linear information coupling 
problem of the common message: 

max. -min{/([/;7f),I([/;^)}, (5) 



subject to: U -> X n -> (Y^^) 



-J(Z7;X-)<ie 2 
n z 



The core problem we want to address here is that whether or 
not the single-letter solutions are optimal for ([5]). To do this, 



suppose that Px n \u= 



)(n) 




Wi [x/P^], for 

In 



J u . Define the DTMs 
= 1, 2, and the scaled 



x 



J n , the problem then becomes 



max 

:||LJI 2 : 



mm 



in {||^ 



(n) ; 



\Bi n) L. 



(6) 



where B^ is the n th Kronecker product of the single-letter 
DTM B i9 for i = 1,2. 

Different from the point-to-point problem, we need to 
choose the perturbation vectors L u to have large images si- 
multaneously through two different linear systems. In general, 
the tradeoff between two SVD structures can be rather messy 

(n) 

problems. However, in this problem, for both i = 1,2, B\ ' 
have the special structure of being the Kronecker product of 



the single letter DTMs. Furthermore, both Bi and B 2 have the 
largest singular value of 1, corresponding to the same singular 
vector = [y/Px, % G X] , although the rest of their SVD 
structures are not specified. The following theory characterizes 
the optimality of single-letter and finite-letter solutions for the 
general cases. 

Theorem 1. Suppose that Bi are DTM 's for some DMC and 
input/output distributions, for i = 1, 2, . . . , k, then the linear 
information coupling problem 

{||i?^L u || 2 }, (7) 



max 

L U :\\L U P- 



mm 

=1 l<i<k 



has optimal single letter solutions for the case with 2 receivers. 
In general, when there are k > 2 receivers, single letter 
solutions can not be optimal, when the cardinality \U\ is 
bounded by some function of \X\. However, there still exists 
k-letter solutions that are optimal 

While we will not present the full proof of this re- 
sult in this paper, it worth pointing out how conceptually 
straightforward it is. We can write the right singular vec- 
tors of the two DTMs, B\ and B 2 , as 

and (f ,(f U ...,(f n 



^0 



»0, <pi, • • • , <Pn-l 

The only structure we have is that 

1 T 



, both correspond to the 
other vectors, the relation 



^fPxJx),xeX 

largest singular value of 1. For 
between the two bases can be written as a unitary matrix ^, 
with <pi = ^Zj^ijtpj- Now, we can define an orthonormal 
basis for the space of multi-letter distributions on X n . For 
example, with 2-letter distributions, we can use (j>i®(t)j,i,j G 
{0, 1, n — 1} and (i,j) ^ (0,0). Note that any L u can 
be written as L u = Yli j^(o 0) a ij^i ® <Ar If a perturbation 
vector L u has any non-zero component along fa <pj , with 
z, j ^ 0, we can always move this component to either <pi®(j)Q 
or 0o00j to have, say, 0i00o = J2j ^ij^j^Po- This results 
in larger norms of the output vectors through both channels. 
As a result, the optimizer of ^ can only have components 
on the vectors 0o ® 4>j and fa 0o- This means that the 
resulting conditional distribution must be product distributions, 
i.e. P X n\u=u = p x 1 \u=u ' Px 2 \u=u • • •• This simple observa- 
tion greatly simplifies the multi-letter optimization problem: 
instead of searching for general joint distributions, now we 
have the further constraint of conditional independence. This 
directly gives rise to the proof of the optimality of i.i.d. 
distributions and hence single letter solutions for the 2 user 
case, which is the first definitive answer on the general 
broadcast channels. 

The more interesting case is when there are more than 
2 receivers. In such cases, i.i.d. distributions simply do not 
have enough degrees of freedom to be optimal in the tradeoff 
of more than 2 linear systems. Instead, one has to design 
multi-letter product distributions to achieve the optimal. The 
following example, constructed with the geometric method, 
illustrate the key ideas. 
Example: 

We consider a 3-user broadcast channel. Let the input al- 
phabet X be ternary, so that the perturbation vectors have 






(a) 




(b) 

Fig. 2. (a) A ternary input broadcast channel (b) The optimal perturbations 
over 3 time slots. 



2 dimensions and can be easily visualized. Suppose that 
the three DTMs are rotations of 0, 27r/3, 47r/3, respec- 
tively, followed (left multiplied) by the projection to the 
horizontal axis. This corresponds to the ternary input chan- 



nels as shown in Figure 2(a) Now if we use single-letter 
inputs, it can be seen that for any L u with ||£ n || 2 = 1> 
min{||BiL u || 2 ,||B 2 L u || 2 ,||B 3 iu|| 2 } < 1/4. The problem 
here is that no matter what direction L Uo takes, the three 
output norms are unequal, and the minimum one always limits 
the performance. Now, if we use 3-letter input, and denote 
(j)0 = [cos#, sin0] T , then we can take 

Px*\u=u = (Px + efo) <8> (P x + e^+iyO ® (Px + ^e+^f) 

for any value of as shown in Figure 2(b)| Intuitively, this 
input equalizes the three channels, and gives for alH = 1, 2, 3, 
\\B^Lu^ || 2 = 1/2, which doubles the information coupling 
rate. Translating this solution to the coding language, it means 
that we take turns to feed the common information to each 
individual user. Note that the solution is not a standard time- 
sharing input, and hence the performance is strictly out of the 
convex hull of i.i.d. solutions. One can interpret this input as a 
repetition of the common message over three time-slots, where 
the information is modulated along equally rotated vectors. 
For this reason, we call this example the "windmill" channel. 
Additionally, it is easy to see that the construction of the 
windmill channel can be generalized to the cases of k > 3 
receivers, where /c-letter solutions is necessary. 

Note that in this example, we let U be a binary random 
variable, and in this case, while there are optimal 3-letter 
solutions, the optimal single-letters do not exist. However, 
one can in fact take U to be non-binary. For example, let 



U = {0,1,2,3,4,5} with P v {u) = 1/6 for all u, and let 
Lu=o = —Lu=i = 00, Lu=2 = —L u= 3 — (j)Q+2*L, and 
L u= 4 — —L u=Fj — (J)q_^_ , then we can still achieve the 
information coupling rate 1/2. Thus, there actually exits an op- 
timal single-letter solution with cardinality \U\ =6. However, 
when there are k receivers, it requires cardinality \U\ — 2k 
for obtaining optimal single-letter solutions. Essentially, this 
example shows that finding a single perturbation vector with 
a large image at the outputs of all 3 channels is difficult. The 
tension between these 3 linear systems requires more degrees 
of freedom in choosing the perturbations, or in other words, 
the way that common information is modulated. Such more 
degrees of freedom can be provided either by using multi-letter 
solutions or have larger cardinality bounds. This effect is not 
captured by the conventional single-letterization approach. 

Theorem [T] reduces most of the difficulty of solving the 
multi-letter optimization problem ([5]). The remaining is to find 
the optimal scaled perturbation L u for the single-letter version 
of ([5]), if the number of receivers k = 2, or the k -letter version, 
if k > 2. These are finite dimensional convex optimization 
problems |8], which can be readily solved. 

We can see that all these information theory problems are 
solved with essentially the same procedure, and all we need 
in solving these problems is simple linear algebra. This again, 
demonstrates the simplicity and uniformity of our approach in 
dealing with information theory problems. 

V. Conclusion 

In this paper, we present the local approximation approach, 
and show that with this approach, we can handle the issue 
of single-letterization in information theory by just solving 
simple linear algebra problems. Moreover, we demonstrate 
that our approach can be applied to different communication 
problems with the same procedure, which is a very attractive 
property. Finally, we provide the geometric insight of the 
optimal finite-letter solutions in sending the common message 
to k > 2 receivers, which also explains why optimal single- 
letter solutions fail to be existed in these cases. 
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